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Abstract 

This paper presents the use of probabilistic 
class-based lexica for disambiguation in target- 
word selection. Our method employs minimal 
but precise contextual information for disam- 
biguation. That is, only information provided 
by the target-verb, enriched by the condensed 
information of a probabilistic class-based lexi- 
con, is used. Induction of classes and fine-tuning 
to verbal arguments is done in an unsupervised 
manner by EM-based clustering techniques. The 
method shows promising results in an evalua- 
tion on real-world translations. 

1 Introduction 

Disambiguation of lexical ambiguities in nat- 
urally occuring free text is considered a hard 
task for computational linguistics. For instance, 
word sense disambiguation is concerned with 
the problem of assigning sense labels to occur- 
rences of an ambiguous word. Resolving such 
ambiguities is useful in constraining semantic 
interpretation. A related task is target-word dis- 
ambiguation in machine translation. Here a de- 
cision has to be made which of a set of alterna- 
tive target-language words is the most appro- 
priate translation of a source-language word. A 
solution to this disambiguation problem is di- 
rectly applicable in a machine translation sys- 
tem which is able to propose the translation al- 
ternatives. A further problem is the resolution 
of attachment ambiguities in syntactic parsing. 
Here the decision of verb versus argument at- 
tachment of noun phrases, or the choice for verb 
phrase versus noun phrase attachment of prepo- 
sitional phrases can build upon a resolution of 
the related lexical ambiguities. 

Statistical approaches have been applied suc- 
cessfully to these problems. The great ad- 
vantage of statistical methods over symbolic- 



linguistic methods has been deemed to be 
their effective exploitation of minimal linguis- 
tic knowledge. However, the best performing 
statistical approaches to lexical ambiguity res- 
olution themselves rely on complex information 
sources such as "lemmas, inflected forms, parts 
of speech and arbitrary word classes [ . . . ] lo- 
cal and distant collocations, trigram sequences, 
and predicate argument association" (Yarowsky 
(1995), p. 190) or large context-windows up to 
1000 neighboring words ( Schiitze, 1992| ). Unfor- 
tunately, in many applications such information 
is not readily available. For instance, in incre- 
mental machine translation, it may be desirable 
to decide for the most probable translation of 
the arguments of a verb with only the transla- 
tion of the verb as information source but no 
large window of surrounding translations avail- 
able. In parsing, the attachment of a nominal 
head may have to be resolved with only informa- 
tion about the semantic roles of the verb but no 
other predicate argument associations at hand. 

The aim of this paper is to use only minimal, 
but yet precise information for lexical ambigu- 
ity resolution. We will show that good results 
are obtainable by employing a simple and nat- 
ural look-up in a probabilistic class-labeled lex- 
icon for disambiguation. The lexicon provides a 
probability distribution on semantic selection- 
classes labeling the slots of verbal subcatego- 
rization frames. Induction of distributions on 
frames and class-labels is accomplished in an 
unsupervised manner by applying the EM algo- 
rithm. Disambiguation then is done by a simple 
look-up in the probabilistic lexicon. We restrict 
our attention to a definition of senses as alterna- 
tive translations of source- words. Our approach 
provides a very natural solution for such a 
target-language disambiguation task — look for 
the most frequent target-noun whose seman- 
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Figure 1: Class 19: "locative action". At the top are listed the 20 most probable nouns in the 
Pic(n|19) distribution and their probabilities, and at left are the 30 most probable verbs in the 
PLC { v 1 19) distribution. 19 is the class index. Those verb-noun pairs which were seen in the training 
data appear with a dot in the class matrix. Verbs with suffix .as : s indicate the subject slot of an 
active intransitive. Similarily .aso : s denotes the subject slot of an active transitive, and .aso : o 
denotes the object slot of an active transitive. 



tics fits best with the semantics required by the 
target-verb. We evaluated this simple method 
on a large number of real- world translations and 
got results comparable to related app roaches 
such as that of Dagan and Itai ( 1994| ) where 
much more selectional information is used. 

2 Lexicon Induction via EM-Based 
Clustering 

2.1 EM-Based Clustering 

For clustering, we used the method described 
in Rooth et al. ( |1999| ). There classes are de- 
rived from distributional data — a sample of 
pairs of verbs and nouns, gathered by parsing 
an unannotated corpus and extracting the fillers 
of grammatical relations. The semantically 
smoothed probability of a pair (v, n) is calcu- 
lated in a latent class (LC) model as plc(v, n) = 
Ylc£CP L c( c ' v > n )- The joint distribution is de- 
fined by plc(c, v, n) = PLc(c)p L c(v\c)p L c(n\c) . 
By construction, conditioning of v and n on each 
other is solely made through the classes c. The 
parameters PLcip), plc{v\c), PLc{n\c) are esti- 
mated by a particularily simple version of the 
EM algorithm for context-free models. Input to 



our clustering algorithm was a training corpus 
of 1,178,698 tokens (608,850 types) of verb-noun 
pairs participating in the grammatical relations 
of intransitive and transitive verbs and their 
subject- and object-fillers. Fig. [l] shows an in- 
duced class from a model with 35 classes. In- 
duced classes often have a basis in lexical se- 
mantics; class 19 can be interpreted as locative, 
involving location nouns "room", "area", and 
"world" and verbs as "enter" and "cross". 

2.2 Probabilistic Labeling with Latent 
Classes using EM-estimation 

To induce latent classes for the object slot of a 
fixed transitive verb v, another statistical infer- 
ence step was performed. Given a latent class 
model Plc(') f° r verb-noun pairs, and a sample 
7ii, ... j tim °f objects for a fixed transitive verb, 
we calculate the probability of an arbitrary ob- 
ject noun n £ N by p(n) = YlceC Pi c i n ) = 
SceC P( c )PLc( n \c) ■ This fine-tuning of the class 
parameters p(c) to the sample of objects for a 
fixed verb is formalized again as a simple in- 
stance of the EM algorithm. In an experiment 
with English data, we used a clustering model 
with 35 classes. From the maximum probabil- 



ity parses derived for the British National Cor- 
pus with the head-lexicalized parser of Carroll 
and Rooth ( 1998]) , we extracted frequency ta- 
bles for transitive verb-noun pairs. These tables 
were used to induce a small class-labeled lexicon 
(336 verbs). 
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(ID 160867) Es gibt emige alte Passvorschriften, die be- 
sagen, dass man einen Pass haben muss, wenn man die 
Grenze ilberschreitet. There are some old provisions re- 
garding passports which state that people crossing the 
{border/ frontier/ boundary/ limit/ periphery/ 
edge} should have their passport on them. 
(ID 201946) Es gibt schliesslich keine Losung ohne die 
Mobilisierung der biirgerlichen Gesellschaft und die 
Sohdaritat der Demokraten in der ganzen Welt. There 
can be no solution, finally, unless civilian {company/ 
society/ companionship/ party/ associate} is 
mobilized and solidarity demonstrated by democrats 
throughout the world. 

Figure 3: Examples for target-word ambiguities 



Figure 2: Estimated frequencies of the objects 
of the transitive verbs cross and mobilize 

Fig. |2| shows the topmost parts of the lexical 
entries for the transitive verbs cross and mo- 
bilize. Class 19 is the most probable class-label 
for the object-slot of cross (probability 0.692); 
the objects of mobilize belong with probability 
0.386 to class 16, which is the most probable 
class for this slot. Fig. || shows for each verb the 
ten nouns n with highest estimated frequencies 
f c (n) = f(n)p(c\n), where f(n) is the frequency 
of n in the sample m,... ,um- For example, the 
frequency of seeing mind as object of cross is 
estimated as 74.2 times, and the most frequent 
object of mobilize is estimated to be force. 

3 Disambiguation with Probabilistic 
Cluster-Based Lexicons 

In the following, we will describe the simple 
and natural lexicon look-up mechanism which 
is employed in our disambiguation approach. 
Consider Fig. ||| which shows two bilingual sen- 
tences taken from our evaluation corpus (see 
Sect. |). The source-words and their corre- 
sponding target-words are highlighted in bold 
face. The correct translation of the source-noun 
(e.g. Grenze) as determined by the actual trans- 
lators is replaced by the set of alternative trans- 
lations (e.g. { border, frontier, boundary, limit, 
periphery, edge }) as proposed by the word-to- 
word dictionary of Fig. || (see Sect. §). 

The problem to be solved is to find a correct 
translation of the source-word using only min- 
imal contextual information. In our approach, 



the decision between alternative target-nouns is 
done by using only information provided by the 
governing target-verb. The key idea is to back 
up this minimal information with the condensed 
and precise information of a probabilistic class- 
based lexicon. The criterion for choosing an al- 
ternative target-noun is thus the best fit of the 
lexical and semantic information of the target- 
noun to the semantics of the argument-slot of 
the target-verb. This criterion is checked by a 
simple lexicon look-up where the target-noun 
with highest estimated class-based frequency is 
determined. Formally, choose the target-noun h 
(and a class c) such that 

f £ (n)= max / c (rt) 

n€N,c£C 

where f c (n) = f(n)p(c\n) is the estimated fre- 
quency of n in the sample of objects of a 
fixed target-verb, p(c\n) is the class-membership 
probability of n in c as determined by the prob- 
abilistic lexicon, and f(n) is the frequency of n 
in the combined sample of objects and transla- 
tion alternatives^. 

Consider example ID 160867 from Fig. ||. The 
ambiguity to be resolved concerns the direct ob- 
jects of the verb cross whose lexical entry is 
partly shown in Fig. ^. Class 19 and the noun 
border is the pair yielding a higher estimated 
frequency than any other combination of a class 
and an alternative translation such as boundary. 
Similarly, for example ID 301946, the pair of the 

1 Note that p(c) = max p(c) in most, but not all cases. 



target-noun society and class 6 gives highest es- 
timated frequency of the objects of mobilize. 

4 Evaluation 

We evaluated our resolution methods on a 
pseudo-disambiguation task similar to that used 



in Rooth et al. (1999) for evaluating cluster- 
ing models. We used a test set of 298 (v, n, n') 
triples where (v, n) is chosen randomly from a 
test corpus of pairs, and n' is chosen randomly 
according to the marginal noun distribution for 
the test corpus. Precision was calculated as the 
number of times the disambiguation method de- 
cided for the non-random target noun (n = n). 

As shown in Fig. ||, we obtained 88 % pre- 
cision for the class-based lexicon (ProbLex), 
which is a gain of 9 % over the best cluster- 
ing model and a gain of 15 % over the human 
baseline^]. 
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Figure 4: Evaluation on pseudo-disambiguation 
task for noun-ambiguity 



The results of the pseudo-disambiguation 
could be confirmed in a further evaluation on a 
large number of randomly selected examples of 
a real-world bilingual corpus. The corpus con- 
sists of sentence-aligned debates of the Euro- 
pean parliament (mice = multilingual corpus 
for cooperation) with ca. 9 million tokens for 
German and English. From this corpus we pre- 
pared a gold standard as follows. We gathered 
word-to-word translations from online-available 
dictionaries and eliminated German nouns for 
which we could not find at least two English 
translations in the mice-corpus. The resulting 
35 word dictionary is shown in Fig. ||. Based on 
this dictionary, we extracted all bilingual sen- 
tence pairs from the corpus which included both 
the source-noun and the target-noun. We re- 
stricted the resulting ca. 10,000 sentence pairs 
to those which included a source-noun from this 



2 Similar results for pseudo-disambiguation were 
obtained for a simpler approach which avoids an- 
other EM application for probabilistic class labeling. 
Here n (and c) was chosen such that f^(v,ii) — 
max((/ic(i', n) + l)pLc(c\v, n)). However, the sensitivity 

c,n 

to class-parameters was lost in this approach. 



dictionary in the object position of a verb. Fur- 
thermore, the target-object was required to be 
included in our dictionary and had to appear 
in a similar verb-object position as the source- 
object for an acceptable English translation of 
the German verb. We marked the German noun 
n g in the source-sentence, its English translation 
n e as appearing in the corpus, and the English 
lexical verb v e . For the 35 word dictionary of 
Fig. H this semi-automatic procedure resulted 
in a test corpus of 1,340 examples. The aver- 
age ambiguity in this test corpus is 8.63 trans- 
lations per source-word. Furthermore, we took 
the semantically most distant translations for 25 
words which occured with a certain frequency 
in the evaluation corpus. This gave a corpus of 
814 examples with an average ambiguity of 2.83 
translations. The entries belonging to this dic- 
tionary are highlighted in bold face in Fig. 
The dictionaries and the related test corpora are 
available on the webQ. 

We believe that an evaluation on these test 
corpora is a realistic simulation of the hard task 
of target-language disambiguation in real-word 
machine translation. The translation alterna- 
tives are selected from online dictionaries, cor- 
rect translations are determined as the actual 
translations found in the bilingual corpus, no 
examples are omitted, the average ambiguity is 
high, and the translations are often very close 
to each other. In constrast to this, most other 
evaluations are based on frequent uses of only 
two clearly distant senses that were determined 
as interesting by the experimenters. 

Fig. |6| shows the results of lexical ambigu- 
ity resolution with probabilistic lexica in com- 
parison to simpler methods. The rows show 
the results for evaluations on the two corpora 
with average ambiguity of 8.63 and 2.83 respec- 
tively. Column 2 shows the percentage of correct 
translations found by disambiguation by ran- 
dom choice. Column 3 presents as another base- 
line disambiguation with the major sense, i.e., 
always choose the most frequent target-noun 
as translation of the source-noun. In column 4, 
the empirical distribution of {v , n) pairs in the 
training corpus extracted from the BNC is used 
as disambiguator. Note that this method yields 
good results in terms of precision (P = ^correct 
/ ^correct + ^incorrect), but is much worse in 
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Figure 5: Dictionaries extracted from online resources 
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Figure 6: Disambiguation results for clustering versus probabilistic lexicon methods 



terms of effectiveness (E = ^correct / ^correct 
+ ^incorrect + #don't know). The reason for 
this is that even if the distribution of (v, n) pairs 
is estimated quite precisely for the pairs in the 
large training corpus, there are still many pairs 
which receive the same or no positive probabil- 
ity at all. These effects can be overcome by a 
clustering approach to disambiguation (column 
5). Here the class-smoothed probability of a 
(v, n) pair is used to decide between alternative 
target-nouns. Since the clustering model assigns 
a more fine-grained probability to nearly every 
pair in its domain, there are no don't know cases 
for comparable precision values. However, the 
semantically smoothed probability of the clus- 
tering models is still too coarse-grained when 
compared to a disambiguation with a proba- 



bilistic lexicon. Here a further gain in preci- 
sion and equally effectiveness of ca. 7 % is ob- 
tained on both corpora (column 6). We conjec- 
ture that this gain can be attributed to the com- 
bination of frequency information of the nouns 
and the fine-tuned distribution on the selec- 
tion classes of the the nominal arguments of the 
verbs. We believe that including the set of trans- 
lation alternatives in the ProbLex distribution 
is important for increasing efficiency, because it 
gives the disambiguation model the opportunity 
to choose among unseen alternatives. Further- 
more, it seems that the higher precision of Prob- 
Lex can not be attributed to filling in zeroes 
in the empirical distribution. Rather, we specu- 
late that ProbLex intelligently filters the empir- 
ical distribution by reducing maximal counts for 



observations which do not fit into classes. This 
might help in cases where the empirical distri- 
bution has equal values for two alternatives. 
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Figure 7: Precision for finding correct and ac- 
ceptable translations by lexicon look-up 

Fig. |j] shows the results for disambiguation 
with probabilistic lexica for five sample words 
with two translations each. For this dictionary, 
a test corpus of 219 sentences was extracted, 200 
of which were additionally labeled with accept- 
able translations. Precision is 78 % for finding 
correct translations and 90 % for finding accept- 
able translations. 

Furthermore, in a subset of 100 test items 
with average ambiguity 8.6, a human judge hav- 
ing access only to the English verb and the set 
of candidates for the target-noun, i.e. the in- 
formation used by the model, selected among 
translations. On this set, human precision was 
39 %. 

5 Discussion 

Fig. |8| shows a comparison of our approach 



to state-of-the-art unsupervised algorithms for 



word sense disambiguation. Column 2 shows the 
number of test examples used to evaluate the 
various approaches. The range is from ca. 100 
examples to ca. 37,000 examples. Our method 
was evaluated on test corpora of sizes 219, 814, 
and 1,340. Column 3 gives the average number 
of senses/translations for the different disam- 
biguation methods. Here the range of the ambi- 
guity rate is from 2 to about 9 senses^]. Column 

4 T he am biguity factor 2.27 attributed to Dagan and 
Itai's (1994) experiment is calculated by dividing their 



4 shows the random baselines cited for the re- 
spective experiments, ranging from ca. 11 % to 
50 %. Precision values are given in column 5. In 
order to compare these results which were com- 
puted for different ambiguity factors, we stan- 
dardized the measures to an evaluation for bi- 
nary ambiguity. This is achieved by calculating 
pi / l°g2 amb f or precision p and ambiguity fac- 
tor amb. The consistency of this "binarization" 
can be seen by a standardization of the differ- 
ent random baselines which yields a value of ca. 
50 % for all approaches^. The standardized pre- 
cision of our approach is ca. 79 % on all test 
corpora. The most direct point of comparison 
is the method of Dagan and Itai ( |1994| ) which 
gives 91.4 % precision (92.7 % standardized) 
and 62.1 % effectiveness (66.8 % standardized) 
on 103 test examples for target word selection 
in the transfer of Hebrew to English. However, 
compensating this high precision measure for 
the low effectiveness gives values comparable to 
our results. Dagan and Itai's (1994) method is 
based on a large variety of grammatical rela- 
tions for verbal, nominal, and adjectival pred- 
icates, but no class-based information or slot- 
labeling is used. Resnik ( 1997 ) presented a dis- 
ambiguation method which yields 44.3 % pre- 
cision (63.8 % standardized) for a test set of 
88 verb-object tokens. His approach is compa- 
rable to ours in terms of informedness of the 
disambiguator. He also uses a class-based selec- 
tion measure, but based on WordNet classes. 
However, the task of his evaluation was to se- 
lect WordNet-senses for the objects rather than 
the objects themselves, so the results cannot 
be compared directly. The same is true for the 
Senseval evaluation exercise ( [Kilgarriff and 
Rosenzweig, 200Q ) — there word senses from the 



HECTOR-dictionary had to be disambiguated. 
The precision results for the ten unsupervised 
systems taking part in the competitive evalu- 
ation ranged from 20-65% at efficiency values 
from 3-54%. The Senseval standard is clearly 
beaten by the earlier results of Yarowsky ( 1995| ) 
(96.5 % precision) and Schiitze (|1992|) (92 % 



average of 3.27 alternative translations by their average 
of 1.44 correct translations. Furtherm ore, w e calculated 
the ambiguity factor 3.51 for Resnik's (1997) experiment 



from his random baseline 28.5 % by taking 100/28.5; re- 
versely, Dagan and Itai's (1994) random baseline can be 
calculated as 100/2.27 = 44.05. The ambiguity factor for 
Senseval is calculated for the noun task in the English 
Senseval test set. 

J Note that we are guaranteed to get exactly 50 % 
standardized random baseline if random ■ amb = 100 %. 
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Figure 8: Comparison of unsupervised lexical disambiguation methods. 



precision). However, a comparison to these re- 
sults is again somewhat difficult. Firstly, these 
approaches were evaluated on words with two 
clearly distant senses which were determined by 
the experimenters. In contrast, our method was 
evalutated on randomly selected actual transla- 
tions of a large bilingual corpus. Furthermore, 
these approaches use large amounts of informa- 
tion in terms of linguistic categorizations, large 
context windows, or even manual intervention 
such as initial sense seeding (Yarowsky, 1995). 
Such information is easily obtainable, e.g., in IR 
applications, but often burdensome to gather or 
simply unavailable in situations such as incre- 
mental parsing or translation. 

6 Conclusion 

The disambiguation method presented in this 
paper deliberately is restricted to the limited 
amount of information provided by a proba- 
bilistic class-based lexicon. This information yet 
proves itself accurate enough to yield good em- 
pirical results, e.g., in target-language disam- 
biguation. The probabilistic class-based lexica 
are induced in an unsupervised manner from 
large unannotated corpora. Once the lexica are 
constructed, lexical ambiguity resolution can be 
done by a simple lexicon look-up. In target- 
word selection, the most frequent target-noun 
whose semantics fits best to the semantics of 
the argument-slot of the target-verb is cho- 
sen. We evaluated our method on randomly se- 
lected examples from real-world bilingual cor- 
pora which constitutes a realistic hard task. Dis- 
ambiguation based on probabilistic lexica per- 
formed satisfactory for this task. The lesson 
learned from our experimental results is that 



hybrid models combining frequency information 
and class-based probabilities outperform both 
pure frequency-based models and pure cluster- 
ing models. Further improvements are to be ex- 
pected from extended lexica including, e.g., ad- 
jectival and prepositional predicates. 
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